Classifying articles in English and German Wikipedia
نویسندگان
چکیده
Named Entity (NE) information is critical for Information Extraction (IE) tasks. However, the cost of manually annotating sufficient data for training purposes, especially for multiple languages, is prohibitive, meaning automated methods for developing resources are crucial. We investigate the automatic generation of NE annotated data in German from Wikipedia. By incorporating structural features of Wikipedia, we can develop a German corpus which accurately classifies Wikipedia articles into NE categories to within 1% F -score of the state-of-the-art process in English.
منابع مشابه
Linked hypernyms: Enriching DBpedia with Targeted Hypernym Discovery
The Linked Hypernyms Dataset (LHD) provides entities described by Dutch, English and German Wikipedia articles with types in the DBpedia namespace. The types are extracted from the first sentences of Wikipedia articles using Hearst pattern matching over part-of-speech annotated text and disambiguated to DBpedia concepts. The dataset covers 1.3 million RDF type triples from English Wikipedia, ou...
متن کاملWikiwhere: An interactive tool for studying the geographical provenance of Wikipedia references
Wikipedia articles about the same topic in different language editions are built around different sources of information. For example, one can find very different news articles linked as references in the English Wikipedia article titled “Annexation of Crimea by the Russian Federation” than in its German counterpart (determined via Wikipedia’s language links). Some of this difference can of cou...
متن کاملInterlingual Aspects Of Wikipedia's Quality
This paper presents interim results of an ongoing project on quality issues concerning Wikipedia. One focus of research is the relation of language and quality measurement. The other one is the use of interlingual relations for quality assessment and improvement. The study is based on monoand multilingual samples of featured and non-featured Wikipedia articles in English, French, German, and It...
متن کاملMultilingual Vandalism Detection using Language-Independent & Ex Post Facto Evidence - Notebook for PAN at CLEF 2011
There is much literature on Wikipedia vandalism detection. However, this writing addresses two facets given little treatment to date. First, prior efforts emphasize zero-delay detection, classifying edits the moment they are made. If classification can be delayed (e.g., compiling offline distributions), it is possible to leverage ex post facto evidence. This work describes/evaluates several fea...
متن کاملMultilingual Vandalism Detection Using Language-Independent & Ex Post Facto Evidence
There is much literature on Wikipedia vandalism detection. However, this writing addresses two facets given little treatment to date. First, prior efforts emphasize zero-delay detection, classifying edits the moment they are made. If classification can be delayed (e.g., compiling offline distributions), it is possible to leverage ex post facto evidence. This work describes/evaluates several fea...
متن کامل